Submitted by Group 17:
from CS 132 WFU
We first import the necessary modules.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as po
po.init_notebook_mode()
We then import the dataset and store it in a dataframe named dataset. We clone the dataset into tweets to be able to manipulate the data non-destructively.
url = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Combined%20Dataset%20-%20Group%2017.xlsx"
dataset = pd.read_excel(url)
tweets = dataset.copy()
tweets.shape
(150, 35)
Since the original dataset contains columns that are not needed for the data exploration, we drop those columns (namely ID, Group, Collector, Category, Topic, Reviewer, and Review). These columns' purpose is to help the researchers distinguish the samples on a meta level and are not necessary for analysis.
tweets.columns
Index(['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
'Rating', 'Reasoning', 'Remarks', 'Marcos supporter',
'Duterte supporter', 'Explanation for the political stance', 'Reviewer',
'Review'],
dtype='object')
tweets = tweets.drop(columns=['ID', 'Group', 'Collector', 'Category', 'Topic', 'Reviewer', 'Review'])
tweets.shape
(150, 28)
Below is a summary of the dataframe information. It shows us at a glance the number of non-null entries in the dataframe. Because we have 150 samples, if the number of non-null objects is less that 150, then there are some holes in our data. For our data pre-processing, then, we investigate these holes.
tweets.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 150 non-null object 1 Tweet URL 150 non-null object 2 Keywords 150 non-null object 3 Account handle 150 non-null object 4 Account name 149 non-null object 5 Account bio 114 non-null object 6 Account type 150 non-null object 7 Joined 150 non-null datetime64[ns] 8 Following 150 non-null int64 9 Followers 150 non-null int64 10 Location 74 non-null object 11 Tweet 150 non-null object 12 Tweet Translated 74 non-null object 13 Tweet Type 150 non-null object 14 Date posted 150 non-null object 15 Screenshot 150 non-null object 16 Content type 149 non-null object 17 Likes 149 non-null float64 18 Replies 149 non-null float64 19 Retweets 149 non-null float64 20 Quote Tweets 149 non-null float64 21 Views 4 non-null object 22 Rating 5 non-null object 23 Reasoning 150 non-null object 24 Remarks 122 non-null object 25 Marcos supporter 150 non-null bool 26 Duterte supporter 150 non-null bool 27 Explanation for the political stance 149 non-null object dtypes: bool(2), datetime64[ns](1), float64(4), int64(2), object(19) memory usage: 30.9+ KB
# This summarizes the columns that do have null values.
for i in tweets.columns[tweets.isna().any()].tolist():
print(i)
Account name Account bio Location Tweet Translated Content type Likes Replies Retweets Quote Tweets Views Rating Remarks Explanation for the political stance
However, while we should be filling in holes where we can, not all values are required nor available. For example, a Twitter account is not required to have a Location or an Account bio, which is why there are more null values in their columns than the others. The column Views is also largely null because the Views feature of Tweets only started rolling out in late December 2022, which is only a small period in the date range covered by our data. Thus, for our data clean-up, we only pay attention to null values that represent an actual lack in information that is needed for our research. These relevant columns are:
Account nameContent typeLikesRepliesRetweetsQuote TweetsExplanation for the political stanceTo correct fields about Tweet or Twitter account info, our method of fixing it will be to replace the NaN values with what is currently. For fields that require our own assessment, we will be filling them out with our input as well.
# tweets[tweets['Account name'].isna()]
# tweets.at[37, "Account name"]
tweets.at[37,'Account name']="Crux of the Matter 🕊🏃♀️🦅🏃♀️"
# tweets[tweets['Account name'].isna()]
# tweets[tweets['Content type'].isna()]
# tweets.at[27, 'Content type']
tweets.at[27, 'Content type'] = "Emotional"
tweets[tweets['Content type'].isna()]
| Timestamp | Tweet URL | Keywords | Account handle | Account name | Account bio | Account type | Joined | Following | Followers | ... | Replies | Retweets | Quote Tweets | Views | Rating | Reasoning | Remarks | Marcos supporter | Duterte supporter | Explanation for the political stance |
|---|
0 rows × 28 columns
# If you uncomment this you can see that the sample with NaN value (27)
# is the same across these four characteristics:
# tweets[tweets['Likes'].isna()]
# tweets[tweets['Replies'].isna()]
# tweets[tweets['Retweets'].isna()]
# tweets[tweets['Quote Tweets'].isna()]
# When checking out the original tweet, all four values are 0
tweets.at[27, 'Likes'] = 0
tweets.at[27, 'Replies'] = 0
tweets.at[27, 'Retweets'] = 0
tweets.at[27, 'Quote Tweets'] = 0
tweets[tweets['Quote Tweets'].isna()]
| Timestamp | Tweet URL | Keywords | Account handle | Account name | Account bio | Account type | Joined | Following | Followers | ... | Replies | Retweets | Quote Tweets | Views | Rating | Reasoning | Remarks | Marcos supporter | Duterte supporter | Explanation for the political stance |
|---|
0 rows × 28 columns
# tweets[tweets['Marcos supporter'].isna()]
tweets.at[25, 'Marcos supporter'] = False
tweets.at[25, 'Duterte supporter'] = True
tweets.at[25, 'Explanation for the political stance'] = "Display name has fist emojis commonly associated with Duterte. Not enough data to show support for Marcos."
Date posted column¶During the process of data visualization in a subsequent section, it was found that not all of the values in the Date posted column were read by Pandas as Python datetime objects, because they were in the DD/MM/YY HH:MM format instead of the YYYY-MM-DD HH:MM:SS format. Thus, they were read as str objects instead.
from datetime import datetime
string_dates = tweets[tweets['Date posted'].apply(lambda x: isinstance(x, str))]
datetime_dates = tweets[tweets['Date posted'].apply(lambda x: isinstance(x, datetime))]
print(f"Dates in str format: {string_dates.shape[0]}")
print(f"Dates in datetime format: {datetime_dates.shape[0]}")
Dates in str format: 30 Dates in datetime format: 120
string_dates['Date posted'].head(5)
44 14/05/22 10:31 46 15/04/21 08:51 47 27/01/21 15:34 48 29/10/20 10:45 50 24/02/22 10:56 Name: Date posted, dtype: object
datetime_dates['Date posted'].head(5)
0 2020-08-30 19:30:00 1 2020-08-23 20:12:00 2 2020-07-03 15:26:00 3 2018-03-05 13:16:40 4 2018-03-04 04:17:33 Name: Date posted, dtype: object
To fix this, we replace the original Date posted column with a modified version that creates a datetime object based on the value from the DD/MM/YY HH:MM formatted string.
def get_date_slice(date):
date_arr = date.split('/')
return list(map(lambda x: int(x), date_arr))
def get_time_slice(time):
time_arr = time.split(':')
return list(map(lambda x: int(x), time_arr))
def get_datetime_from_str(date_str):
if isinstance(date_str, datetime):
return date_str
split_date_time = date_str.split(' ')
date = get_date_slice(split_date_time[0])
time = get_time_slice(split_date_time[1])
return datetime(2000 + date[2], date[1], date[0], time[0], time[1])
tweets_test = tweets['Date posted'].map(get_datetime_from_str)
tweets['Date posted'] = tweets_test
tweets['Date posted']
0 2020-08-30 19:30:00
1 2020-08-23 20:12:00
2 2020-07-03 15:26:00
3 2018-03-05 13:16:40
4 2018-03-04 04:17:33
...
145 2021-02-11 16:48:00
146 2021-10-17 15:57:00
147 2021-10-17 00:43:00
148 2021-06-10 16:54:00
149 2021-05-08 08:03:00
Name: Date posted, Length: 150, dtype: datetime64[ns]
As a result of our processing, all of the values in the Date posted column are now datetime objects.
tweets['Date posted'].apply(lambda x: isinstance(x, datetime)).describe()
count 150 unique 1 top True freq 150 Name: Date posted, dtype: object
Our only data that requires encoding are the columns for Marcos supporter and Duterte supporter.
tweets['Marcos supporter'] = tweets['Marcos supporter'].replace({True: 1, False: 0})
tweets['Marcos supporter']
0 0
1 1
2 0
3 1
4 0
..
145 0
146 1
147 1
148 1
149 1
Name: Marcos supporter, Length: 150, dtype: int64
tweets['Duterte supporter'] = tweets['Duterte supporter'].replace({True: 1, False: 0})
tweets['Duterte supporter']
0 1
1 1
2 1
3 1
4 1
..
145 1
146 1
147 1
148 1
149 1
Name: Duterte supporter, Length: 150, dtype: int64
With our dataset cleaned up, we can look at the distribution of values in the dataset.
tweets.describe()
| Joined | Following | Followers | Date posted | Likes | Replies | Retweets | Quote Tweets | Marcos supporter | Duterte supporter | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 150 | 150.000000 | 150.000000 | 150 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.00000 |
| mean | 2017-03-22 02:33:36 | 724.093333 | 1344.080000 | 2020-12-20 20:05:06.986666752 | 14.026667 | 1.093333 | 4.880000 | 0.906667 | 0.713333 | 0.88000 |
| min | 2006-08-01 00:00:00 | 0.000000 | 0.000000 | 2017-12-01 11:30:00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
| 25% | 2014-01-01 00:00:00 | 130.000000 | 64.000000 | 2020-05-31 09:16:30 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.00000 |
| 50% | 2018-01-30 12:00:00 | 327.000000 | 273.000000 | 2020-11-10 00:03:30 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| 75% | 2020-05-01 00:00:00 | 778.000000 | 882.250000 | 2022-05-17 22:37:45 | 4.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.00000 |
| max | 2022-06-01 00:00:00 | 9381.000000 | 25419.000000 | 2022-12-30 20:35:00 | 531.000000 | 46.000000 | 203.000000 | 41.000000 | 1.000000 | 1.00000 |
| std | NaN | 1253.035516 | 3433.864717 | NaN | 59.543109 | 5.046557 | 22.912416 | 4.849890 | 0.453719 | 0.32605 |
We then try to visualize a general overview of the tweets in our dataset based on our classification of whether or not they are Marcos or Duterte supporters.
marcos = tweets.query("`Marcos supporter` == 1").shape[0]
duterte = tweets.query("`Duterte supporter` == 1").shape[0]
marcos_duterte = tweets.query("`Marcos supporter` == 1 and `Duterte supporter` == 1").shape[0]
marcos_only = tweets.query("`Marcos supporter` == 1 and `Duterte supporter` == 0").shape[0]
duterte_only = tweets.query("`Marcos supporter` == 0 and `Duterte supporter` == 1").shape[0]
neither = tweets.query("`Marcos supporter` == 0 and `Duterte supporter` == 0").shape[0]
total = tweets.shape[0]
pie_data = np.array([marcos_duterte, marcos_only, duterte_only, neither])
pie_labels = [
"Marcos-Duterte",
"Marcos only",
"Duterte only",
"Neither"
]
interactive_pie = go.Pie(labels=pie_labels, values=pie_data, pull=[0, 0, 0, 0.2], title="Posters' Political Leaning")
fig = go.Figure(data=interactive_pie)
fig.show()
rational = tweets.query("`Content type` == \"Rational\"").shape[0]
emotional = tweets.query("`Content type` == \"Emotional\"").shape[0]
transactional = tweets.query("`Content type` == \"Transactional\"").shape[0]
content_type_data = np.array([rational, emotional, transactional])
content_type_labels = [
"Rational",
"Emotional",
"Transactional",
]
acct_type_counts = pd.DataFrame({
'Content Type': content_type_labels,
'No. of tweets': content_type_data
})
fig = px.bar(acct_type_counts, x="Content Type", y="No. of tweets", title="Content Type of collected tweets")
fig.show()
It can be seen that the majority of the tweets collected were Emotional, with a lot of them also being replies to other tweets.
emotional_tweets = tweets.query("`Content type` == 'Emotional'")
emotional_tweets[['Tweet', 'Content type', 'Tweet Type']]
reply_count = emotional_tweets[emotional_tweets['Tweet Type'].str.contains('Reply')].shape[0]
print(f"Number of Emotional tweets that are also replies: {reply_count}")
emotional_tweets[['Tweet', 'Content type', 'Tweet Type']]
Number of Emotional tweets that are also replies: 88
| Tweet | Content type | Tweet Type | |
|---|---|---|---|
| 0 | Kayo po pumatay d ang government. Huwag nyo k... | Emotional | Text, Reply |
| 1 | Kawawang kabataan,sinayang ang magandang kinab... | Emotional | Text, Video |
| 2 | Bakit namin kailangan protektahan ang aming mg... | Emotional | Text, Image |
| 3 | @anakbayan_ph\n😂😂😂\nTERORISTA pa more! | Emotional | Text, Image |
| 4 | @anakbayan_mm@ asaan na kayabangan ninyo na ka... | Emotional | Text |
| ... | ... | ... | ... |
| 145 | NPA Mga salot sa lipunan. Tanga nlang mga nani... | Emotional | Text, Reply |
| 146 | Its terrorism...\nAnakbayan is a legal front o... | Emotional | Text, Reply |
| 147 | Kabataan Partylist? Seriously? Isa yan sa mga ... | Emotional | Text, Reply |
| 148 | Kawawang mga NPA at DILAWANSHIT 🤣🤣🤣 | Emotional | Text, Reply |
| 149 | @pnppio @TeamAFP don’t stop the attacks on com... | Emotional | Text, Reply |
113 rows × 3 columns
def all_quarters():
ret = []
years = [str(x) for x in range(2016, 2023)]
quarters = ["Q1", "Q2", "Q3", "Q4"]
for year in years:
for qtr in quarters:
ret.append(year + qtr)
return ret
quarter_posted = pd.PeriodIndex(tweets["Date posted"], freq='Q')
tweets['Quarter posted'] = quarter_posted
quarter_counts = list(map(lambda qtr: (tweets['Quarter posted']==qtr).sum(), all_quarters()))
# Heatmap of when collected tweets were collected
data=np.array([[quarter_counts[x] for x in range(i*4,i*4 + 4)] for i in range(0, 7)])
data = data.T
fig = px.imshow(data,
labels=dict(x="Year", y="Month", color="Tweets"),
x=[str(x) for x in range(2016, 2023)],
y=['Q1', 'Q2', 'Q3', 'Q4'],
title="Distribution of 'Date posted' for tweets by quarter"
)
fig.show()
By visualizing the distribution of post dates, we were hoping to gain additional insights on the context of the increase or decrease in the number of redtagging tweets posted during certain periods.
From the 150 tweets collected, the greatest number of redtagging tweets were found during the 2nd Quarter of 2020 and during the 2nd Quarter of 2022.
Though the scope is limited to the 150 tweets that were collected by the researchers, which could have been affected by any biases introduced by Twitter's search algorithm, the surge in numbers coincides with two major events: